Imagine that we are some fancy data scientists exploring - once again - the Gapminder data. We are particularly interested in the development of the GDP across time and across countries. Some R-fanatics from GESIS suggested that we use this tidyverse thing to complete our tasks. They also told us that we do not always need to load all of its packages at once.

1

Load all packages from the tidyverse for importing Excel data and for data wrangling.
You can find the names of the required packages in the slides and exercises for sessions A3, (Tidy Data), A4 (Importing Data), and A5 (Data Wrangling 1).

Ok, that wasn’t too hard. But data science is about data, so we have to load the data we are interested in.

2

Import the GDP data from Gapminder as gap_gdp. Make sure to only import the Excel sheet named “Data”.
Individual sheets can be chosen by using the argument sheet = "name_of_your_sheet"

Have the data been successfully imported? They should be in a tibble with the dimensions 275 x 53. As a further check: The income per person for Algeria for the years 1960, 1961, and 1962 should be 1280, 1085, and 856.

3

Proof that the income per person for Algeria of the years 1960, 1961, and 1962 are 1280, 1085, and 856
Algeria is in the 5th row of the dataset, and the relevant variables are in the first four columns. You can also subset datasets also by selecting columns by number with select() and filtering rows by number with slice().

Let’s say that we are interested in the earliest 10 years as well as the most recent 10 years that appear in the dataset. If we want to aggregate the data per year, they should ideally be in long format.

4

Re-arrange the data such that they are in the long format.
Remember that the command for converting wide format data to long format is gather(). Additionally, you might want to create a more convenient column name for the variable Income per person (fixed 2000 US$) with rename() (GDP might be a good choice here) and change the variable type for year to integer.

There are still a lot of missing values we might want to get rid of, and the data are not arranged in a way that is ideal to explore changes over time. For the next tasks, simply re-use the previous code and add the following commands with %>%.

5

Remove all missing values and arrange the data in ascending order of years and GDP.
There are several ways to exclude missing values. The most convenient one is to use filter() in combination with !is.na.

Now we have a - more or less - clean dataset for our actual task: calculating the mean values across all countries for each of the first ten years and each of the last ten years. What’s still a little bit distracting is that we have the values for all years between these two periods in the data. However, we might want to use some of these data points in future analyses. Hence, we will do all analyses ‘on the fly’ (i.e., without creating a new dataset). Let’s start with the first period.

6

Calculate the mean value of GDP across all countries for each of the first ten years in the dataset.
As the year variable is an integer, you can simply filter the range of years you are interested in. The first year in the dataset is 1960.

Now it should be easy do the same for the 10 most recent years in the dataset…

7

Calculate the mean value of GDP across all countries for each of the last ten years.
The most recent year in the dataset is 2011.